Remarks 

This is in response to the Advisory Action dated 18 July 2006. Applicant 
respectfully requests reconsideration and allowance of the subject Application. 

No Claims are cancelled by this amendment and no Claims are added. 

Claims 30 and 37 are amended. Accordingly, Claims 30-37 are pending 
in this application. 
35 U.S.C- § 103 Claim Rejection 

Claims 30-37 were rejected under 35 U.S.C. § 103(a) as being 
unpatentable over U.S. Patent No. 6.018,805 to Ma, et al (hereinafter "Ma"), in 
view of U.S. Patent No. 6,351,487 to Lu (hereinafter "Lu"). Applicant respectfully 
traverses the rejection. 
Claimed Invention 

Claims 30-37 are directed to methods of, and systems for, recovering 
from a failure of server with a client. The method maintains the state of the 
connection to the client process layer, and not the state of the process layer. A 
layer of software called "wrapping" surrounds the connection-oriented network 
protocol layer (such as a TCP layer) and intercepts all communication with that 
layer, such as communication originating from a network layer or a process layer. 
The intercepted communications and connection state information associated 
with intercepted communications received from the wrapping layers are logged. 
When it is determined that a connection with the server fails, the wrapping layers 
respond to the client on behalf of the server based in part, on the logged 
connection state information. A state of connection associated with the 
connection-oriented layer prior to the failure is restored, based in part, on the 
connection state information received from the wrapping layers. There is no 
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need to interpose a proxy or intermediary between the client computer and the 
server computer. Servers using the invention can crash and recover without 
dropping connections. The act of restoring the state of connection associated 
with the connection-oriented layer is invisible to the client. 
References 

The Office cites Ma & Lu in its § 103 rejection of Claims 30-37. 

Ma teaches a recoverable distributed-object application having a client 
object running on a client machine on a network and a first server object on a first 
server machine on the network with an intelligent proxy running on the client 
machine. Col. 2, lines 24-28. Ma also teaches establishing a connection to a 
second server. Col. 4, lines 40-46. When the client communicates with the 
server, a proxy for the server is created on the client machine. Col, 4, lines 36- 
42. The intelligent proxy establishes the new connection to a second server 
when the first server object does not respond. Col. 2, lines 40-45. The system of 
Ma is dependent upon middle-ware layers, skeletons and proxies. Col. 4, lines 
10-11. 

Ma does not teach a method of recovery from a failure of a server to a 
client that does not need to interpose a proxy or intermediary between the client 
and server. (See specification, page 18 and Claims 30 and 37) Likewise, Ma 
fails to teach responding to the client on behalf of the server by wrapping layers 
and without the need of a second server. 

Lu is directed to a computer system using a first DSL modem to 
communicate packets to a second DSL modem after communication is 
established between the two modems. (Lu, Abstract). Lu is related to DSL 
technology and is directed to a system with a modem device driver. (Lu, Col. 1, 
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lines 38-40), Lu teaches NDIS wrapping, which relates to modem drivers but not 
a process layer. NDIS is a network driver interface specification whose purpose 
is for communication so that a modem device driver doesn't need to 
communicate directly to a stack driver. (Lu, Col. 17, lines 55-57). The NDIS 
wrapper provides a device driver programming interface allowing multiple 
network protocols to share the same network. (Lu, Col 17, lines 65- Col. 18, 
lines 1-3). 



Claim Analysis for §103 Reiection: 

Independent Claim 30 recites: 

A method of recovering from a failure of a server to a client, 
comprising: 

using wrapping layers to intercept communications to a 
connection-oriented protocol layer, the communications originating 
from a network layer and a process layer of a layered 
communications framework, wherein a first wrapping layer is 
interposed between the process layer and the connection-oriented 
protocol layer, and wherein a second wrapping layer is interposed 
between the network layer and the connection-oriented layer; 

logging the intercepted communications and connection 
state information associated with intercepted communications 
received from the wrapping layers; 

determining when a connection with the server fails; 

responding to the client on behalf of the server by the 
wrapping layers based on at least, in part, on the logged connection 
state information; and 

restoring a state of connection associated with the 
connection-oriented layer prior to the failure, based on at least, in 
part, on the connection state information received from the 
wrapping layers, wherein restoring the state of connection 
associated with the connection-oriented layer is invisible to the 
client and wherein recovery from failure of the server by the client 
does not involve interposing a proxy or intermediary. 

Ma fails to teach or suggest the method of Claim 30 because Ma requires 
the use of proxies that requires the use of a second server. According to the 
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teachings of Ma, when the client communicates with the server, a proxy for the 
server is created on the client machine. Col. 4, lines 36-42. The intelligent proxy 
establishes the new connection to a second server when the first server object 
does not respond. Col. 2, lines 40-45. The system of Ma is dependent upon 
middle-ware layers, skeletons and proxies. (Col. 4. lines 10-11), whereas the 
present claimed invention (Claim 30) does not need to interpose a proxy or 
intermediary between the client computer and the server computer. 

The Office admits that Ma fails to disclose "proxies being wrappers, and 
wrappers interposed between the network layer and the connection-oriented 
layer." However, the Office contends that Lu teaches wrappers (60, 64) (col. 36- 
48 and also teaches wrappers being interposed between a network layer (54) 
and a connection oriented layer (58) (col. 17, lines 38-48, Fig. 5). The Office also 
contends that the wrappers of the present claimed invention are the proxies 
taught by Ma, asserting that they are used to perform similar functions. 

Applicant sincerely believes the Office to have mischaracterized the art by 
purporting that a proxy is the equivalent of a wrapper and that the two are 
interchangeable. The Office states at paragraph 7 that "Examiner has 
interpreted the proxies taught by Ma to be the wrappers in the instant application 
since they are being used to perform similar functions to Applicant's claimed 
invention". A wrapper is not a proxy and cannot be substituted for a proxy to 
perform a similar function. The two are entirely distinct creatures; they operate 
differently and with entirely different network architecture, to solve different 
problems. 

A wrapper is an interface - it is a layer of software that surrounds the TCP 
layer. In the present claimed invention, two wrapping layers cooperate to 
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maintain the current state of the TCP connection. A north side wrapper 
surrounds the application layer and a south side wrapper surround the IP layer. 
This approach does not affect the software running on the client, does not cause 
the TCP's implementations to be changed and does not use a proxy. 

A proxy is an entirely separate application - it is an agent whereby one 
system fronts for another. When a proxy is used to mask connection failures, all 
TCP traffic is redirected between the client and server through the proxy. The 
proxy maintains the state of connection between the client and the server If the 
server crashes, the proxy switches the connection the connection to an alternate 
server (not cooperate to restart the server, as wrappers of the present invention). 
A proxy must ensure that sequence numbers of the new connection are 
consistent with the old connection. This approach introduces a single point of 
failure - the proxy. 

Thus there are significant substantive differences between a wrapper and 
a proxy that prevent anyone skilled in the art to interpret the proxies taught by Ma 
to be the wrappers in the present invention. As evidence to support this 
statement. Applicant cites a research article published by IEEE Infocom in 2001; 
Wrapping Server-Side TCP to l\/lask Connection Failures, by Alvisi, et al, a copy 
of which is hereto submitted for the Examiner's review. The article carefully 
delineates the difference between different methods of recovering a TCP session 
following a crash. One of the methods employs the use of a proxy. The main 
drawback to using a proxy is the introduction of a new single point of failure - the 
proxy. See Alvisi, page 2, second paragraph. Failure of the proxy introduces the 
original problem all over again - reestablishing a connection with the server after 
a failure. The use of wrappers as taught by the present claimed invention does 
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not suffer from the drawbacks of other approaches; namely, Applicant's method 
implements a fault-tolerant TCP that does not affect the software running on a 
client, does not cause the server's TCP implementation to be changed, and 
does not use a proxy. See Alvisi, page 2, third paragraph. 

The Office states that the teachings of Lu demonstrate that wrappers were 
well known in the art at the time of the present invention, and that the 
combination of Ma and Lu it would have been obvious to a person skilled in the 
art to disclose the proxies as wrappers to advantageously provide for means of 
making server failures transparent to a client while providing layers that would 
prevent modification of the connection-oriented protocol. However, Applicant 
respectfully disagrees with this statement and directs the Examiner's attention to 
the cited publication distinguishing the differences between the use of proxies 
and wrappers. 

The aforementioned references are devoid of any teaching or suggestion 
of how to restore a state of connection associated with a connection-oriented 
layer prior to failure through the use of wrapping layers. There is simply no 
discussion in either Ma or Lu of restoring connections with a server in a manner 
as recited in Claim 30. Thus, the cited references do not teach or suggest the 
method of Claim 30 and for the same reason fail to teach the system of 
independent Claim 37, either singularly or in combination. Accordingly, there 
would be no motivation to combine Ma and Lu, to arrive at Claims 30 and 37. 

For all the reasons described above, the combination fails to teach or 
suggest independent Claims 30 or 37. 

Claims 31-36 depend from Claim 30 and are allowable by virtue of this 
dependency. Additionally, these claims recite additional features that, when 
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taken together with those of Claim 30, define methods that are not taught or 

suggested by the Ma and Lu combination. 

Conclusion 

Pending Claims 30-37 are in condition for allowance. Applicant 
respectfully requests reconsideration and issuance of the subject application. If 
any issues remain that preclude issuance of this application, the Examiner is 
urged to contact the undersigned attorney before issuing a subsequent Action. 



Respectfully submitted, 
WERNER & AXENFELD, PC 



Dated: ^'Vl^Pm 
PO Box 1629 

West Chester, PA 19380 
610-701-5810 
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Wrapping Server-Side TCP to Mask Connection 

Failures 

Lorenzo Alvisi, Thomas C. Bressoud, Ayman El-Khashab, Keith MarzuUo, Dmitrii Zagorodnov 



Abstract — We present an implementation of a fault- 
tolerant TCP (FT-TCP) that allows a faulty server to keep its 
TCP connections open until it either recovers or it is failed 
over to a backup. The failure and recovery of the server 
process are completely transparent to client processes con- 
nected with it via TCP. FT-TCP does not affect the software 
running on a client, does not require to change the server's 
TCP implementation, and does not use a proxy. 

Keywords — Fault-tolerance, TCP, Rollback-Recovery. 

I. Introduction 

WHEN processes on different processors communi- 
cate, they most often do so using TCP. TCP, which 
provides a bi-directional byte stream, is used for short 
lived sessions like those commonly used with HTTP, for 
long lived sessions like large file transfers, and for contin- 
uous sessions like those used with BGP [8]. TCP is ex- 
ceptionally well engineered; literally man-millennia have 
gone into the design, implementation, performance tuning 
and enhancement of TCP. 

Consider a TCP session set up between two processes, 
one of which is a client and the other a server. For our 
purposes, the difference between the two is a question of 
deployment and control: an organization is responsible for 
the server, but clients are associated with individuals that 
may not be part of that organization.^ This system is dis- 
tributed, and can suffer from the failure of the client or 
the server. The failure of the client is out of the control 
of the organization, but the failure of the server is not. If 
the server provides some service from which the organi- 
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^With our terminology a server can do both an active and a passive 
open. 



zation earns money, then recovery of the server is impor- 
tant. Recovering the state of the server's application can be 
done by checkpointing the application's state and restart- 
ing from this checkpoint on either the same or a different 
processor. But, the TCP session will need to be recovered 
as well. We assume that rewriting the client and server ap- 
plications to detect TCP session loss and to reestablish the 
session as necessary is not a feasible approach because of 
the cost and danger of introducing bugs into the applica- 
tion. 

One approach to recovering the TCP session is to insert 
a layer of software between the TCP layer and the applica- 
tion layer of the client, and to insert a similar layer between 
the TCP layer and the application layer of the server. Such 
a layer provides the TCP abstractions as well as recovers 
a lost TCP session. To do the latter, this layer implements 
some kind of checkpointing of the TCP connection state. 
It also re-establishes the connection between the old client 
and the new server in a state consistent with the state of 
the connection when the old server crashed. This is not 
a trivial layer of software to develop, especially when the 
failure-free performance of the connection is an issue [7]. 
The primary drawback of this approach, though, is that the 
required layer of software must be run by both the server 
and the clients. For some applications this is not a prob- 
lem, but in general clients are out of the control of the or- 
ganization that maintains the servers. 

A second approach is to redesign the TCP layer on 
the server to add support for checkpointing and restarting 
to the TCP implementation. This redesigned TCP layer 
checkpoints the state of the connection so that the new 
server sends packets consistent with the state of the con- 
nection when the old server crashed. For example, the 
TCP protocol requires each end of a connection to choose 
fresh initial sequence numbers when opening a connec- 
tion. The redesigned TCP layer would need to re-open the 
connection using the old initial sequence numbers. This 
redesigned TCP layer is also not trivial to implement, es- 
pecially when failure-free performance and security of the 
connection are important. The primary drawback with this 
approach, though, is that it requires a new TCP implemen- 
tation on the server side. At best, the support would be 
retrofitted into an existing implementation, and so would 
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need to be re-retxofitted whenever the TCP implementa- 
tion is improved or enhanced. Such repeated retrofitting 
exacts a high development and maintenance cost. 

A third approach is to leave the client and server code 
untouched, and to redirect all TCP traffic between them 
through a proxy. This approach generalizes the w6flc13e- ^ 
scribed in [5] that uses such a proxy for mobile comput- 
ing. JThe prox y maintains the state of the connection be- 
tween thecnentand~thFs^^ Sie server crashes, then 
tfie^ proxy, switches the connectibrrfcTan alternate server 
and ensures that the new connection is consistent with the 
client's state. Like the second approach, this approach 
needs to ensure the sequence numbers of the new con- 
xnection are consistent with those of the old connection. 
<^ The main drawback of this approach is that it introduces 
\a new single point of failure, namely the proxy. This sin- 
gle point of failure is especially troublesome when a proxy 
is shared among many servers. And, when the proxy fails, 
the TCP connection between the client and the proxy needs 
to be made fault-tolerant, which raises the original prob- 
lem again. ^^^^^^^^^^ 

The approach that we develop in this paper does not suf- 
fer from the drawbacks of the previous approaches. We 
present an implementation of a fault-tolerant TCP that 
does not affect the software running" on a client, does not 
cause the server's TCP implementation to be changed, and 
does not use a proxy. With Fault-Tolerant TCP (FT-TCP) 
a faulty process can keep its TCP connections open until 
it either recovers or it is failed over to a backup. No client 
process connected with a crashed process running FT-TCP 
can detect any anomaly in the behavior of their TCP con- 
nections: the failure and recovery of the crashed process 
are completely transparent. We show that for some reason- 
able network configurations, our approach has a negligible 
impact on both throughput and latency. 

Although the details of the solution that we outline 
are specific to TCP, the architecture that we propose is 
sufficiently general to be applicable in principle to other 
connection-oriented network protocols. 

The remainder of this paper is organized as follows. In 
Section n we give an overview of the major architectural 
components of FT-TCP. This is followed by a discussion 
of the recovery of logged data in Section III. In Section IV 
we describe the FT-TCP protocol, and its operation during 
failure-free executions and recovery. Section V presents an 
empirical evaluation of the performance of FT-TCP Sec- 
tion VI discusses further implementation issues, and Sec- 
tion VII concludes the paper. 

Because of lack of space, we omit reviewing TCP We 
assume that the reader is familiar with how connections are 
opened and closed, how TCP represents the sliding win- 



dow, and how flow control is implemented. 

II. Architecture 

FT-TCP is based on the concept of wrapping, in which 
aTayer of software surrounds the TCP layer and intercepts 



all communication with that layer. The communication 
can come from either the IP layer upon which the TCP 
layer is built (the corresponding wrapper is called the south 
side wrap, or SSW for short) and from the application that 
uses the TCP layer to read and write data (the correspond- 
ing layer is called the north side wrap, or NSW for short). 
These two wraps in turn communicate with a logger. The 
resulting architecture is illustrated in Figure 1. Together, 
these three components maintain at the logger the current 
state of the TCP connection. And, if the current TCP con- 
nection goes down, they cooperate to restart the server, to 
restore the state of the TCP connection to what it had be- 
fore the crash, and to map the TCP sequence numbers used 
with the client state to the TCP sequence numbers used 
with the current server state and vice versa. 

Given a TCP connection between a client and a server, 
we call the byte stream from the client to the server the 
instream and the byte stream from the server to the client 
the outstream. 

Application 



North Side Wrap 



Logger 



South Side Wrap 




Fig. 1 . FT-TCP architecture. 

SSW intercepts data passing between the TCP layer and 
the IP layer. For segments coming from the TCP layer 
to the IP layer, SSW maps the sequence number from the 
client's connection state to the server's current connection 
state. It does so to allow a recovering server's TCP to 
propose a new initial sequence number during the three- 
way handshake at connection establishment; SSW trans- 
lates the sequence numbers to be consistent with those 
used in the original handshake. For packets going from 
the IP layer to the TCP layer, SSW performs the inverse 
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mapping on the ACK number. SSW also sends packets to 
the logger and either modifies or generates acks coming 
from the server to the client. It does so to ensure that the 
client side discards data from its send buffer only after it 
has been logged at the logger. 

NSW intercepts read and write socket calls from the 
application to the TCP layer. During normal operation, 
NSW logs the amount of data that is returned with each 
read socket call. We call this value the read length for 
that socket call. When a crashed server is recovered, NSW 
forces read socket calls to have the same data and read 
lengths. It does so to ensure deterministic recovery, as we 
discuss in Section III. During recovery NSW also discards 
write socket calls to avoid resending data to the client. 

The logger runs on a processor that fails independently 
from the server. It stores information for recovery pur- 
poses. In particular, it logs the connection state informa- 
tion (such as the advertized window size and the acknowl- 
edgment sequence number), the data, and the read lengths. 
The logger acknowledges to the north and south side wraps 
after logging data.^ 

Checkpointing the state of the server running on the 
server is outside of the scope of this paper. Hence, in this 
paper we assume that a restarting server has the applica- 
tion restart from its initial state. We assume that the pro- 
cess issues the same sequence of read socket calls when 
replayed as long as the read lengths of each read socket 
call has the same value as before. We also assume that 
there is only a single TCP stream open with the server. The 
generalization of the ideas given in this paper to a server 
supporting multiple streams is not hard, but having multi- 
ple streams often implies a multithreaded server. With a 
multithreaded server, the interdigitation of stream reads by 
different threads needs to be addressed. Again, this issue 
is outside of the scope of this paper. 

Our protocol requires a mechanism that allows a process 
on another processor to take over the IP address of a pro- 
cess on a failed processor. This mechanism also updates 
the ARP cache of any client on the same physical network 
as the failed or recovered server. The latter can be done us- 
ing gratuitous ARP. See [9] for further details and related 
issues. 

III. Recovering from Logged Data 

FT-TCP recovers the state of a crashed server from the 
recovery data on the logger. Hence, it is necessary that 
the logger store the latest state of the server, namely all 
of the packets it has received and the read lengths it has 

^ We use the term ack to refer to a TCP segment that acknowledges the 
receipt of data, and use the term acknowledgement to refer to a message 
that the logger sends to the server indicating that data is logged. 



generated. Waiting for the recovery data to be stored at 
the logger, however, incurs a prohibitively large latency, 
and so FT-TCP sends recovery data to the logger asyn- 
chronously. Because some recovery data may be lost when 
a crash occurs, this asynchronous logging of data can re- 
store the server to a state that is earlier than the state that 
the client knows the server had attained before crashing. 

For example, consider the connection state information. 
The ACK sequence number of a packet from the server in- 
dicates the amount of data that the server has received from 
the client. Suppose that the server sends a packet with an 
ACK sequence number asn but the logger has stored data 
only through asn - ifor i > 1. When the server recov- 
ers, the TCP layer knows of data only through asn — i, 
and the next packet it sends has an ACK sequence num- 
ber less than asn. To recover the server farther, the client's 
TCP layer would need to send the missing data. Since 
the server has already acknowledged through asn, how- 
ever, the client may have discarded this data from its send 
buffers. To avoid this problem, SSW never allows the ACK 
sequence number of an outgoing segment to be larger than 
asn-i-\- 1. 

A similar problem occurs with respect to the data ex- 
changed between the client and server applications. The 
state of the server application may depend on the read 
lengths the application observes. For example, suppose 
the application attempts to read 8,000 bytes. It may take 
a different action if the read socket call returns less than 
1,000 bytes as compared to it returning at least 1,000 bytes. 
It is because of such a possibility that NSW records read 
lengths. 

Suppose now that a read returns 900 bytes. NSW 
sends this read length to the logger, but the server crashes 
before the logger receives this message. Thus, the read 
length is lost. After restarting, the read is re-executed, 
but, because more data has become available in the receive 
buffers of TCP, the new read returns 1,500 bytes, bring- 
ing the server to a state inconsistent with what it had before 
the crash. If the client can observe this inconsistency — for 
example, by receiving data from the server both before it 
crashed (reflecting the state in which the server read only 
900 bytes) and after it crashed (reflecting the state in which 
die server read 1,500 bytes) then the failure of the server 
is not masked. If the only way in which the client can ob- 
serve the state of the server is by receiving data from it, 
then this problem can be solved by delaying all write 
socket calls by the server application until all prior read 
lengths are known to be stored on the logger. This is what 
FT-TCP does. 

The problem of restoring the crashed server to a state 
consistent with its last state observed by a client is an in- 
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Stance of the more general Output Commit problem [2]. 
We discuss the Output Commit problem further in Sec- 
tion VI. 

IV. Protocol 

We now describe the operation of FT-TCP. After intro- 
ducing the state that FT-TCP maintains, we describe how 
an FT-TCP connection is set up. We then describe the op- 
eration of SSW and NSW while the TCP connection is 
open and operational, either for the first time or after a 
failed server has recovered. We call this mode normal op- 
eration. We then describe how a failed server recovers its 
TCP stream state. 

A. Variables 

FT-TCP maintains the following variables: 

• delta_seqThis variable allows SSW to map sequence 
numbers for the outstream between the server's TCP layer 
and the client's TCP layer. 

• stable^eq This variable is the smallest sequence 
number of the instream that SSW does not know is stored 
on the logger. Note that this value can never be larger (ig- 
noring 32-bit wrap) than the largest sequence number that 
the server has received from the client. This variable is 
also computed during recovery from the data stored on the 
logger. 

• serverjseq This variable is the highest sequence 
number of the outstream that SSW knows to have been ac- 
knowledged by the client. This variable can be computed 
during recovery from the data stored on the logger. 

• unstable-reads This variable counts the number of 
read socket calls whose read lengths NSW does not know 
to be recorded by the logger. If unstable_reads is 
zero, then NSW knows that the logger has recorded the 
read lengths of all prior read socket calls. 

• restarting This boolean is true while the server is 
not in normal operation. 

B. Opening the Initial Connection 

When the connection is initially established, die only 
action FT-TCP takes is to capture and log both the client's 
and the server's initial sequence numbers. SSW does not 
pass the segment that acknowledges the client's S YN to the 
server's IP layer until the logger acknowledges that these 
initial sequence numbers are logged. If it did not do this, 
the client might believe a connection is established that a 
failure and recovery might not be aware of. 

SSW completes the initialization of FT-TCP by setting 
delta-seq to zero, stable-seq to the client's initial 
sequence number plus one, unstable-reads to zero, 
and restarting to false. 



C Normal Operation of SSW 

During normal operation, SSW responds to diree differ- 
ent events: receiving a packet from IP, receiving a segment 
from the TCP layer, and th& receipt of an acknowledge- 
ment from the logger. 

When SSW receives a packet from EP, it first forwards 
the packet to the logger. SSW then subtracts delta_seq 
from the ACK number. Since doing so changes the pay- 
load, SSW recomputes the TCP checksum on the segment. 
Recomputing the checksum is not expensive: it can be 
done quickly given the checksum of the unchanged seg- 
ment, the old ACK number, and the new ACK number. 
SSW then passes the result to the TCP layer, without wait- 
ing for an acknowledgement from the logger indicating 
that the packet has been logged. 

When SSW receives an acknowledgement from the log- 
ger for a packet, SSW updates s table_seq if necessary. 
Specifically, if the acknowledgement is for a packet that 
carries client data with sequence numbers from sn through 
sn -h i, then s table_seq is set to the larger of its current 
value and sn-\- 

When SSW receives a segment from the TCP layer, it 
remaps the sequence number by adding del ta-seq to it. 
SSW then sets the ACK number to stable_seq. Since 
stable_seq never exceeds an ACK number generated 
by the TCP layer, modifying the ACK number may re- 
sult in an effective reduction of the window size adver- 
tised by the server. For example, suppose that the seg- 
ment from the TCP layer has an ACK number of asn and 
an advertised window of w. This means that the server's 
TCP layer has sufficient buffering available to hold client 
data up through sequence number asn -h - 1. By setting 
the ACK number to s table_seq the SSW effectively re- 
duces the buffering for client data by asn - stableLseq. 
To compensate, SSW increases the advertised window by 
asn - s table-seq. Again, after modifying the TCP seg- 
ment, the TCP checksum must be recomputed. Finally, the 
TCP segment is passed to IP. 

Figure 2 illustrates normal operation for SSW as de- 
scribed above. 

D. Normal Operation of NSW 

During normal operation, NSW reponds to three differ- 
ent events: a read socket call, a write socket call, and 
receiving an acknowledgement from the logger. 

For each read, NSW sends the resulting read length to 
logger and increments unstable_reads. When NSW 
receives the acknowledgement from the logger, it decre- 
ments unstablejreads. And, for each write, NSW 
blocks the call until unstable_reads is zero. 
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Fig. 2. Normal operation of SSW. 



E, Re-establishing a Connection 

The following steps occur when the server crashes: 

1. The logger detects the failure of the server. It temporar- 
ily takes over the role of the server by responding to in- 
stream packets with a TCP segment. This segment has a 
closed window and acks the data received from the client 
up to the last value the logger has for stable_seq. 

2. The server restarts and FT-TCP reconnects with the log 
server. FT-TCP sets stable-seq and server_seq to 
the values computed by the logger based on the logged 
data, uns tabls-xeads is set to zero and irecovering 
is set to true. 

Once the server obtains and acknowledges this information 
from the log server, the logger implicitly relinquishes the 
generation of closed window acknowledgements to SSW. 
SSW continues to generate periodically these acknowl- 
edgements as long as recovering is true. 

3. The restarting application executes either an accept 
or connect socket call. We describe here only what hap- 
pens if accept is called; connect is handled similarly. 
NSW has SSW fabricate a S YN that appears to come from 
the client. This SYN has an initial sequence number of 
stable^eq. Thus, the server's TCP layer will start ac- 
cepting client data with sequence number stable_seq 
plus one. SSW passes this SYN to the TCP layer. 

4. The acknowledging SYN generated by the server's TCP 
layer is captured by SSW. SSW sets delta_seq to the 
logged initial server sequence number minus the server's 
TCP layer new proposed initial sequence number. SSW 
discards this segment, fabricates the corresponding ack, 
and passes the ack to the server's TCP layer 



5. The server's application starts mnning as part of the 
restart of the server Every read socket call it executes is 
captured by NSW, which supplies the corresponding data 
from the logger The amount of data returned with each 
read is determined by the corresponding logged read 
length. 

Every write socket call is also captured by NSW. NSW 
keeps a running total of the number of bytes that have been 
written by the server since starting recovery. As long as 
each write produces data that was written before (as de- 
termined from the logged initial chent sequence number 
and server-seq), NSW discards the write and returns a 
successful write completion to the application. 
Once the last logged read is replayed and the server's 
application has written all of the bytes it had written be- 
fore, NSW sets recovering to false, and the server 
resumes the normal mode of operation. This may cause 
a read socket call to block until all the replayed writes 
have occurred. It may also cause data to be rewritten to 
the outstream after recovering is true. In the latter 
case, the resulting TCP segments will be discarded by the 
client's TCP layer as being delayed duplicates. 

E Ack Strategies 

As described Section IV-C, SSW modifies acks in the 
outstream to ensure that the client does not discard in- 
stream data before the SSW knows it is logged on the log- 
ger. It does so by ensuring that the ack sequence number is 
never larger than stable_seq. All outstream segments 
are immediately processed and passed to IP. Further, no 
additional segments are generated by SSW. We call this 
the Basic ack strategy. By itself, Basic is not a satisfactory 
strategy. 

To make this concrete, assume that a segment S arrives 
at SSW in the instream carrying bytes starting with se- 
quence number sn. SSW sends this data to the logger, 
but by the time the server's TCP layer generates an ack for 
it, the logger has not yet acknowledged it, meaning that 
stable^eq is still less than sn. Even if the acknowl- 
edgement from the logger arrives immediately thereafter, 
the client's TCP layer will not become aware of it until the 
server's TCP layer sends a subsequent segment. 

Such a situation inhibits the client's ability to mea- 
sure the round trip time (RTT). Worse occurs, however, 
when the outstream traffic is low and the instream traffic is 
blocked due to windowing restrictions. For example, con- 
sider what happens when slow start [4] is in effect. Sup- 
pose that the client sends two segments Si and S2 when 
the client's congestion window is two segments in size and 
is less than the server's advertised window. If the acks to 
these packets are generated before either are logged on the 
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logger, then the client will block with a filled congestion 
window, and the server will block starved for data. This 
situation will persist until the client's TCP layer retrans- 
mits Si and 52. 

Two simple ack strategies that avoid such problems are: 
Lazy: SSW uses the Basic ack strategy for outstream seg- 
ments that carry data. Segments that carry no data (and 
hence are only acking the dehvery of instream data) are 
instead held by SSW until their ack sequence number is at 
least stable_seq. 

Eager: SSW uses the Basic ack strategy. In addition, SSW 
generates an ack on the outstream for every acknowledge- 
ment it receives from the logger, thus acking every in- 
stream packet. 

The Eager strategy can significandy increase bandwidth 
demand. A fourth ack strategy, which we call Conditional, 
addresses this drawback. It is a variant of Eager for which 
only some acknowledgements from the logger cause SSW 
to generate an ack. Specifically, consider the point SSW 
receives an acknowledgement from the logger for a packet 
5. SSW generates an ack iff, given that TCP has attempted 
an ack for S, and during the interval from the arrival of S 
at SSW to the arrival of the acknowledgement of 5, SSW 
receives no packets from the IP layer. 

The performance of FT-TCP using only Basic is quite 
poor; we have found that the resulting FT-TCP is often un- 
able to sustain a bulk-transfer connection. We discuss the 
performance of the other three ack strategies in Section V. 

G. Logger 

The server and the logger communicate using TCP. The 
server sends a segment to the logger for every segment it 
receives from the client. Each such segment generates an 
acknowledgement from the logger, which is simply a 32- 
bit integer. This difference in the amount of traffic raises 
the question of whether Nagle's algorithm [6] should be 
enabled for the stream from the logger to the server. En- 
abling Nagle results in batching multiple acknowledge- 
ments into a single segment and therefore reduces the load 
on the network connecting the server and the logger. If 
this network is the same one that the client is on, then the 
extra packet overhead incurred using TCP with Nagle dis- 
abled may significandy reduce the bandwidth of the FT- 
TCP connection. We explore this question in Section V. 

Each acknowledgement from the logger is simply a se- 
quence number: it is the lowest sequence number of client 
data that is not logged. Thus, the sequence of acknowl- 
edgements is monotonically increasing (ignoring the 32 bit 
wrap). This means that the last acknowledgement in any 
batch contained in a segment is the only one that needs 
to be processed by SSW, since it dominates the other ac- 



knowledgements. We have found, though, that the over- 
head incurred by having SSW process each acknowledge- 
ment is small enough that it is not worth taking advan- 
tage of this observation. Note that with the conditional 
ack strategy, the only acknowledgement in such a batch 
that could cause SSW to generate an ack (as discussed in 
Section IV-F) is the last one. Hence, the effect of process- 
ing the other acknowledgements is just to increase sta- 
blejseq. 

V. Performance 

In this section, we first describe the metrics of interest 
and the experimental setup. We then present the results of 
our experiments. 

A. Goals and Experiments 

To show that FT-TCP is viable in practice, we evaluated 
a prototype implementation of FT-TCP. Specifically, we 
used an application in which the client transmits a stream, 
as bulk data, to the server, as fast as it can. The server 
simply discards this data. We measured: 

1 . The tiiroughput of FT-TCP as compared to the through- 
put for the same bulk data transfer of an unwrapped TCP 
layer at the server. 

2. The additional latency introduced by FT-TCP. 

3. The recovery time of the server, divided into its con- 
stituent parts. 

We chose bulk transfer from the client to the server for 
two reasons. First, by simple inspection of the protocol, 
it is clear that the outstream incurs a much smaller over- 
head when compared with the instream. Second, bulk data 
transfer with no server computation is the most disadvan- 
tageous workload for FT-TCP; we expect that with other 
workload types, FT-TCP would compare more favorably 
with TCP 

We ran our experiments using three separate machines, 
one each for the client, the server, and the logger. On the 
server, we implemented Linux kernel modules for NSW 
and SSW, a Unix application providing communication 
between the kernel modules and the logger, and the server 
application. We implemented corresponding applications 
on the chent and the logger. 

Our server was a 450 MHz Pentium n workstation with 
512 KB cache and 128MB of memory, while both the 
client and the logger ran on 300 MHz Pentium lis with 
512 KB cache and 64 MB of memory. The server allowed 
connection to the client and the logger via two separate 
100 Mbps Ethernet adaptors (both Intel EtherExpress Pro 
100). The client used a 10 Mbps 3Com Megahertz PCM- 
CIA Ethernet card and the logger used a 100 Mbps 3Com 
PCI Etherlink XL. FT-TCP wrappers were implemented 
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for Linux kernel 2.0.36, while Linux 2.2.x kernels were 
run on the client and logger machines. 

The tcpdump utility was used to collect timestamps 
and packet information for the connections under test. 
This was executed on the client machine to get accurate 
client-side measurements of latency of FT-TCP connec- 
tions. 

To measure throughput and latency, we ran our experi- 
ments with three different network configurations: 
L The client and server share one 10 MB Ethernet and the 
server and logger share another 10 MB Ethernet. We call 
this configuration 10-10. 

2. The client, server, and logger are all on the same 10 MB 
Ethernet. We call this configuration 10 Shared. 

3. The client and server share one 10 MB Ethemet and the 
server and the logger share a 100 MB Ethemet. We call 
this configuration 10-100. 

The first configuration models a simple network setup. 
The second configuration allows us to determine the im- 
pact of having all three system components share the same 
network. The third configuration allows us to remove the 
network bandwidth between the server and the logger as 
the (expected) bottleneck. 

To measure recovery time, we timed how long it takes 
to recover the server from the data on the logger. Since 
the client does not participate in recovery, for these exper- 
iments we considered only the two configurations 10-10 
and 10-100. 

B. Results 

The results below are for a 1MB transfer from client 
to server. We first gathered the results of a non-wrapped 
TCP stack at the server, with the same client and server 
applications as used for the experimental runs. 

For each network configuration and ack strategy, we 
gathered results from 12 runs. To measure throughput, 
we applied a simple linear least squares fit with the inde- 
pendent variable being the beginning sequence number of 
a segment and the dependent variable being the time this 
segment was received by SSW. The coefficient of determd- 
nation for all but one of tfiese fits is 0.99 or better; 
for the remaining one (Lazy for 10 Shared) is 0.83. The 
slope of a least squares fit provides an accurate represen- 
tation of the throughput of the connection. We present the 
error bounds of these slopes for a 95% confidence interval. 

We measured latency by post-processing the tcpdump 
results to determine the interval from the time data in the 
instream was sent by the client to the time the ack for that 
data was received by the client. We averaged these inter- 
vals and calculated 95% confidence intervals. 



B.l 10-10 

Table I presents the results for the 10-10 configuration 
giving throughput, average latency, and ack count for an 
unwrapped TCP {Clean) and for each ack strategy. 

TABLE 1 
10-10 Performance. 







Error 




Avg. 


Error 






Throughput 


Bound 


% of 


Latency 


Bound 


Ack 




(KB/s) 


(KB/s) 


Clean 


(ms) 


(ms) 


Count 


Clean 


1007.69 


0.53 


100.00% 


5.82 


3.39 


3064 


Lazy 


241.33 


0.76 


23.95% 


27.75 


17.23 


1766 


Lazy64k 


488.08 


3.26 


48.44% 


44.89 


29.75 


555 


Eager 


722.84 


1.16 


71.73% 


56.70 


35.77 


12810 


Cond. 


651.02 


2.50 


64.61% 


53.96 


58.46 


3276 



From the table, we first note that the throughput of the 
Lazy ack strategy is only 24% of that of unwrapped TCP 
The explanation for this results is attained through compar- 
ison with unwrapped TCP as observed through the tcp- 
dump logs of the runs. 

Under unwrapped TCP, the server application is at 
least as fast as the client. Good bandwidth utilization 
is achieved through a well-formed interleaving of the in- 
stream data packets within the advertised window with the 
sequence of acks returning to the client. To illustrate by 
example, say that the advertised window has a capacity of 
six packets. At some point in the steady state of the trans- 
fer, the client sends segments x, x -f 1, a; + 2. At this point 
in the interleaving, the server sends an ack for the bytes 
in x - 1, which allows the client to send packets a; + 3, 
X + 4, X -f 5, and then receives the ack for the bytes in 
X + 2. This pattern then repeats. Under this interleaving, 
the client is rarely stalled awaiting an ack from the server 
to allow more data to be sent. 

Under FT-TCP, the Lazy ack strategy exhibits a pattern 
in which the client sends all the data possible in the win- 
dow and then stalls for an acknowledgment. This ack is 
only sent after the ack field has been acknowledged by 
the logger. This pattern of behavior is indicative of a fast 
sender and a slow receiver. Note the reduced number of 
acknowledgments over the 12 runs. With the additional 
latencies, the processing of FT-TCP, and the communica- 
tion between server and logger, the server application is no 
longer consuming data as fast as the client is sending it. 

When we tried to address the problem by increasing the 
receive buffer size to 64K bytes (indicated in table entry 
Lazy64K), we found that the pattern of burst of data fol- 
lowed by acks persisted, but its effect was reduced at a 
cost of increased average latency^. The Eager ack strat- 
egy generates acks more aggressively and so it helps break 

^The average latency is indicative of the size of the window and the 
number of unacknowledged packets in the window. The standard de- 
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the fast-sender-and-slow-receiver pattern. This comes at a 
cost of increased bandwidth for the additional acks. The 
Conditional ack strategy also helps break the pattern, but 
does better than Eager since it does not generate as many 
acks. 

B.2 10 Shared 

In this network configuration, the same 10 Mbps net- 
work segment is utilized both by the client to server com- 
munication and by the server to logger communication. 
Because of this, at least twice as many data bits are trans- 
mitted across the shared medium. Since the client and un- 
wrapped TCP server are close to saturating the 10 Mbps 
link, FT-TCP can only be expected to perform no better 
than 50% of clean TCP. In addition, there is increased 
contention for the CSMA/CD physical link and the cor- 
responding backoffs. 

The results in Table II show that Eager and Conditional 
provide approximately a third of unwrapped TCP perfor- 
mance. Conditional achieves this by sending 3,500 acks, 
while Eager sends almost 13,000 acks. These additional 
acks take up additional bandwidth and degrade perfor- 
mance slightly. In this configuration, Lazy suffers from 
the same performance-draining burst interleaving that it 
did with the previous network configuration. And, as with 
10-10, when we increased the receive buffer size, the same 
pattern developed, but with reduced effects. 

TABLED 
10 Shared Performance. 





Throughput 
(KB/s) 


Error 
Bound 
(KB/s) 


% of 
Clean 


Avg. 
Latency 
(ms) 


Error 
Bound 
(ms) 


Ack 
Count 


Clean 


1007.69 


0.53 


100.00% 


5.82 


3.39 


3064 


Lazy 


195.17 


3.79 


19.37% 


29.15 


20.75 


2100 


Lazy64k 


321.20 


1.39 


31.88% 


72.06 


43.92 


559 


Eager 


338.50 


0.60 


33.59% 


89.35 


108.13 


12946 


Cond. 


343.41 


0.73 


34.08% 


89.47 


92.73 


3515 



Under the Lazy ack strategy, the error bound on 
throughput is noticeably worse. Further, recall from the 
beginning of Section V-B that the coefficient of determina- 
tion W- on a linear least squares fit for Lazy with 10 Shared 
was 0.83. The cause for the relatively large residual vari- 
ance is that the set of 12 runs exhibits a bimodal pattern: 
some of the runs complete the 1 MB transfer in just over 
3 seconds while the others take over 5 seconds. Figure 3 
illustrates this bimodality where a representative efficient 
Lazy run is labeled Lajy Good and a representative ineffi- 
cient run is labeled Lazy Bad, For both completeness and 
comparison, the figure also includes representative runs for 

viation also increases since a single acknowledgment acks data from 
multiple packets in the window. 



unwrapped TCP, Eager, and Conditional. The Lazy Bad 
run exhibits the pattern of fast sender, slow server as dis- 
cussed in the previous section. 



1200000 




0 -F^ 1 1 1 , , 

0 1 2 3 4 5 6 

Retative Time (sec.) 

Fig. 3. 10 Shared Acks vs. Time. 

A closer look at these runs illustrate the ack pattem and 
ack frequency of each of the strategies. Figure 4 presents 
the stream relative acks as a function of time for the first 
200 ms of the same sample runs shown in Figure 3. For all 
of the runs, the first 20 ms are consumed in getting fi*om 
the point of connection establishment to the first instream 
data ack. Clean shows a nearly ideal pattem of acks. For 
Eager, many of acks are redundant and come in bursts, re- 
sulting in the staircase pattem of the curve. Conditional 
moderates this effect with fewer acks, and, for Lazy Bad, 
the acks are very infrequent. 




0 0.02 0.04 0.06 0.03 0.1 0.12 0.14 0.16 0.18 
Relative Time (sec) 



Fig. 4. 10 Shared Acks vs. Time for a Run Prefix. 
B.3 10-100 

As with 10-10, this network configuration separates the 
network segments used for client to server communication 
and server to logger conrniunication. Further, the server 
to logger communication is over a faster link. The effect 
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of the faster back-end link is to reduce the incremental la- 
tency of sending data from the server to logger and receiv- 
ing the acknowledgment from 2-3 ms in the 10 Mbps case 
to 0.5-0.7 ms. 

As Table III shows, reducing the latency makes a strik- 
ing difference in performance. This seemingly small dif- 
ference takes us under a threshold so that the receiver 
(which includes the server application as well as FT-TCP 
components supporting communication to the logger) is 
able to keep up with the client. 

TABLE m 
10-100 Performance. 





Throughput 
(KB/s) 


Error 
Bound 
(KB/sec) 


%of 
Clean 


Avg. 
Latency 
(ms) 


Error 
Bound 
(ms) 


Ack 
Count 


Clean 


1007.69 


0.53 


100.00% 


5.82 


3.39 


3064 


Lazy 


1007.56 


0.57 


99.99% 


5.86 


3.14 


3082 


Lazy64k 


929.92 


3.91 


92.2S% 


17.48 


22.84 


1267 


Eager 


969.84 


0.31 


96.24% 


5.31 


3.23 


11687 


Cond. 


985.84 


0.38 


97.83% 


6.03 


3.30 


5833 



For Eager and Conditional, we see the same behavior as 
before: Eager spends bandwidth to achieve a level of per- 
formance that Conditional exceeds by keeping the client 
well-acknowledged with less than half the number of acks. 
The clear winner here, however, is Lazy. With such a small 
additional latency arising from the logger, Lazy sends acks 
back to the client with the same interleaving pattern with 
respect to data packets in the window as clean TCP. In do- 
ing so, it achieves a performance that is statistically indis- 
tinguishable from clean TCP in both throughput and la- 
tency. 

B.4 Impact of Nagle Algorithm 

The measurements given above were all made with the 
Nagle algorithm enabled on the connection from the logger 
to the server. The effect on throughput of having the Nagle 
algorithm disabled on the connection is complex. Table IV 
summarizes our measurements on this effect. First, as one 
would expect, the Nagle algorithm has little effect in the 
10-100 configuration; the latency is low enough that small 
segments are rarely delayed. Hence, the number of seg- 
ments from the logger to the server changes litde when 
Nagle is disabled. 

For the other two network configurations, the effect of 
disabling Nagle can be significant. With Lazy, the addi- 
tional acknowledgements from the logger cause acks to be 
more frequently generated, which breaks down the pattern 
of a fast sender and slow server. In fact, with the Nagle 
algorithm disabled. Lazy goes from being the worst to the 
best ack strategy for the lO-shared network configuration 
in terms of throughput. The impact of disabling Nagle is 



TABLE IV 

Logger Nagle Disabled Performance. 



Nagle on Nagle off change 



Clean 


1007.69 


1007.69 


0.0% 


Lazy 


1007.56 


1007.42 


0.0% 


Lazy64k 


929.92 


896.20 


-3.6% 


Eager 


969.84 


969.11 


-0.1% 


Cond 


985.84 


987.32 


0.2% 



Clean 


1007.69 


1007.69 


0.0% 


Lazy 


241.33 


494.22 


104.8% 


Lazy64k 


488.08 


501.08 


2.7% 


Eager 


722.84 


514.72 


-28.8% 


Cond 


651.02 


506.92 


-22.1% 



10 shared 



Clean 


1007.69 


1007.69 


0.0% 


Lazy 


195.17 


368.53 


88.8% 


Lazy64k 


321.20 


321-52 


0.1% 


Eager 


338.50 


274.23 


-19.0% 


Cond 


343.41 


312.72 


-8.9% 



lost, however, when the server TCP buffering is increased 
(Lazy 64k); the benefit of increased server buffering seems 
to be close to that of disabling Nagle. For the other two ack 
strategies, disabling Nagle has a negative effect. Some of 
this decrease arises from increased contention on the net- 
work between the server and logger. Examination of the 
TCP dump logs, however, hints that more is going on than 
just increased contention. This point will require deeper 
research to fully understand. 

B.5 Recovery 

Table V summarizes the measured recovery times for 
20 runs recovering the server application with the full 1 
MB of data logged by FT-TCP. We measured both the time 
from start of recovery to end of recovery and the time re- 
quired just to play back the data through NSW. From these 
measurements, we obtained both the restart time, which 
includes process restart as well as simulating the SYN se- 
quence to the server TCP, and the replay time for the set of 
reads encountered by NSW. 

TABLE V 
Recovery Measurements. 





Restart 


Error 


Replay 


Error 


Avg. 
Read 
Time 
(ms) 


Error 




Hme 


Bound 


Time 


Boa ad 


Bound 




(ms) 


(ms) 


(sec) 


(sec) 


(ms) 


10-10 


17.31 


0.02 


2.571 


0.(M)201 


1.893 


0.{K)15 


10-100 


22.21 


13.31 


0.485 


0.(K)036 


0-338 


O.CK)03 


Local 


22.44 


14.08 


0.056 


0.(K)025 


0,039 


o.mi 



As a reference point, the table also includes recovery 
time from a logger co-located on the server's physical ma- 
chine (Local), thus showing the performance if the back- 
end network latency were totally eliminated. 
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Restart time for all three cases is around 20 ms. For 
both the 10-100 and Local, there was one outlier that took 
almost a half second to start the process, explaining the 
higher mean and error bound. 

Local gives a lower bound for replay of the 1 MB in- 
stream at 56 ms. The current recovery implementation 
uses a synchronous interface between NSW and the log- 
ger on each encountered read call. This serialization of 
recovery data from the logger to NSW explains the low ef- 
fective throughput relative to the bandwidth of the 10-10 
and 10-100 cases. A simple optimization would employ 
read-ahead by NSW on recovery to achieve much better 
results. 

VL Other Issues 

In this section we discuss some additional implementa- 
tion issues. 

A. Forced Termination of a Server 

For most Unix versions, when a process is abnormally 
terminated the process exit routine closes all open TCP 
streams. For example, if a SIGKILL is sent to the server, 
then the server implicitly closes the connection before ter- 
minating. This may result in either an orderly release — 
the server's TCP starts the close handshake by sending 
a FIN — or an abortive release — the server sends a RST. 
Since we test our implementation by terminating the server 
with a UNIX signal, this implicit close has proven prob- 
lematic. 

We work around this problem by having NSW capture 
the close socket call as well and by using OS platform 
specific information to identify the close as arising from 
abnormal process termination (rather than a valid close 
by the server's application). When the close is raised by 
process cleanup, SSW discards the FIN or RST segment 
and initiates recovery. 

B, Output Commit 

In Section HI we gave the reasoning for blocking writes 
to the TCP stream while prior read lengths are not known 
to be logged. By doing so, we ensure that die client will 
never know more about the latest state of the server than 
the logger knows. This approach relies on the (probably 
valid) assumption that the only communication between 
the client and the server is via the TCP stream. And, the 
assumption that read lengths are significant nondetermin- 
istic events is weak; we are aware of only contrived servers 
for which the sequence of bytes written to the outstream is 
determined by read lengths. 

A related issue has to do with consistency between the 
recovered server and the environment (such as file servers. 



databases, and physical actuators). A server that changes 
the environment needs to be written to handle the situa- 
tion in which the server's failure makes it uncertain upon 
recovery whether or not the change occurred [3]. The typ- 
ical approach is to make all changes to the environment 
idempotent, and have the recovered server redo the change 
should it be unsure whether the change was indeed made 
before failure. However, if the server changes the environ- 
ment before the prior read lengths are stored on the logger, 
then the server may be unable to recover to the state in 
which the change occurred. Avoiding this possibility is 
called the Output Commit problem [2]. 

In FT-TCP, the Output Commit problem can be ad- 
dressed by preventing all changes to the environment while 
there are prior read length not known to be logged. Con- 
ceptually, this is simple to do; NSW intercepts all such 
changes and blocks them in the same way it blocks writes 
to the TCP stream. In practice, this requires to identify a 
set of system calls that, like write, can modify the envi- 
ronment. These calls would be intercepted by NSW. 

C. Other Architectures 

The architecture we describe for FT-TCP — a logger run- 
ning on a processor separate from the server — is not the 
only one worth considering. For example, one could have 
the logging of packets done on the same processor as the 
server. The logging could be done asynchronously to disk 
by using existing techniques for making in-memory logs 
as reliable as disk [1]. This architecture would decrease 
the latency of the FT-TCP connection, but would probably 
not increase the bandwidth. It would also make the only 
processor that the server could restart on be the original 
processor. 

Another possible architecture implements the logger as 
a hot standby for the server. The packets sent by SSW 
would be injected into the SSW of the standby, and the 
reads of the NSW of the standby would block until the read 
length from the server arrived. Such an architecture would 
allow for a small failover time since recovery would just 
have the TCP stream migrate to die hot standby. The open 
protocol described in Section IV-E is essentially the proto- 
col that the hot standby runs to set up its TCP stream, and 
the normal operation protocol for the server is unchanged. 

VIL Conclusion 

We have described the architecture and performance of 
FT-TCP, a software that wraps an existing TCP layer to 
mask server failures from unmodified clients. We have 
implemented a prototype of FT-TCP and find that, even 
for a demanding application, it imposes a low overhead on 
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throughput and latency when the connection between the 
server and the logger is fast. 

Further experimentation is called for. For example, we 
have measured the performance of FT-TCP only for appli- 
cations with one-way bulk transfer from the client to the 
server. A study with a wider set of realistic applications 
would give a better idea of the actual overhead of FT-TCP. 
And, even with this simple kind of application, we have 
found that the properties of the communication between 
the logger and the server have a large and complex influ- 
ence on the overhead. Finally, we still need to better under- 
stand the impact of FT-TCP on the client TCP layer. For 
example, we have not yet measured the magnitude of the 
impact of a server crash and recover on the client's RTT 
measurements. 
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